-
Notifications
You must be signed in to change notification settings - Fork 0
feat: add xatu-cbt seed data generator with S3 upload support #70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- introduce new CLI commands: `xcli lab xatu-cbt` and `xcli lab xatu-cbt generate-seed-data` - add AWS SDK v2 dependencies for S3/R2 upload - implement interactive and scripted modes for data extraction - support filtering by model, network, spec, range, and custom SQL - auto-generate xatu-cbt test YAML templates after extraction - upload parquet files to ethpandaops R2 bucket with overwrite check - document usage and S3 credential setup in README
Introduce a new CLI sub-command that automates the creation of complete test YAML files for transformation models. The command resolves the full dependency tree, queries external ClickHouse for available data ranges, generates seed parquet files for every external dependency, and optionally uploads them to S3. It also supports AI-generated assertions via Claude. - New command: `xcli lab xatu-cbt generate-transformation-test` - Dependency tree resolution with cycle detection - Range intersection across all external models - Batch parquet generation and S3 upload - AI assertion generation using Claude CLI - Interactive and scripted modes
…load perf(seeddata): replace MIN/MAX with ORDER BY LIMIT 1 for faster range queries refactor(seeddata): split range query into two single-value queries chore(seeddata): reduce query timeout from 2m to 30s
Allow callers to force a specific range column instead of using per-model detection. This enables consistent range queries across models when needed.
…nctions This introduces functionality to sanitize IPv4 and IPv6 columns during seed data generation by hashing them with a shared salt, ensuring consistent anonymization across related data sets while preserving the original IP address type structure (IPv4 vs IPv6, including IPv4-mapped IPv6). This required adding salt generation, ClickHouse schema introspection (`DESCRIBE TABLE`), and dynamic SQL query construction to replace raw column selection with sanitization expressions in `lab_xatu_cbt_generate_seed_data` and `lab_xatu_cbt_generate_transformation` commands.
… transformation This change updates the seed data generation and transformation commands to report which IP columns were sanitized during the process, improving user visibility into data masking operations.
feat: Sanitize IPs for parquet file uploads
…est generation This change introduces predefined time range presets (e.g., "Last 5 minutes", "Last 1 hour") to simplify selecting the time window for seed data generation during transformation tests, along with accounting for ingestion lag when calculating the effective maximum time.
…ormation generation
feat: Defined ranges
…r-pt2 feat: add generate-transformation-test command for xatu-cbt
- Remove rangePresets, ingestionLagBuffer and manual range prompts - Add --duration flag and new seeddata/discovery.go module - Integrate Claude-based range strategy generation with fallback heuristic - Validate data availability per model before generation - Streamline UX: single duration prompt, AI summary, confirmation flow
…L filter analysis - add support for entity/dimension tables (no time range) via intervalType - read intermediate transformation SQL to extract WHERE clause filters - extend discovery prompt to include correlation filters for dimension tables - add FilterSQL and CorrelationFilter to TableRangeStrategy for precise filtering - improve fallback discovery to handle entity models and missing ranges - normalize YAML field names and fix unquoted datetime values in Claude responses - extend QueryRowCount and GenerateOptions to accept additional SQL filters - add S3 Cache-Control: no-cache header for fresh seed data downloads
feat: enhance seed-data discovery with dimension table support and SQL filter analysis
feat: replace manual range selection with AI-driven discovery
- add detection for common field name typos like "primaryrangeType" - expand normalizeDiscoveryYAMLFields to handle PascalCase, snake_case and other variations Claude might output - enhance discovery prompt with stricter formatting rules and examples - include full YAML content in error messages for easier debugging - log when YAML normalization actually changes field names
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
xcli lab xatu-cbtandxcli lab xatu-cbt generate-seed-data